This document explores some options for trained ensembles we could start using for COVID-19.

Overall Scores

These scores summarize model skill for each combination of base target and spatial scale.

For brevity, we'll look here at performance for a subset of the variations on "trained" approaches that we have considered. Below are the settings we're examining, and reasons we chose them from among the alternatives.

  • We use the constraint that the model weights are non-negative and sum to 1, and we do not include an intercept. A more flexible variation only enforces that the weights are non-negative and includes an intercept; overall, the performance of this method can slightly better for cases than the convex versions, but its performance seems less stable, with a lot of variation in performance for different window sizes -- and it is consistently much worse for deaths. I have stuck with the more constrained method with more stable performance.
  • Missing forecasts are mean-imputed and then weights are redistributed according to missingness levels; this approach has limitations and needs refinement, but has been better than performing estimation separately for each group of locations with complete data in every evaluation I've looked at.
  • We do not employ any checks of model forecasts other than the validations performed on submission. I have not looked at approaches using these checks recently, but in analyses from a few months ago they were not very helpful for trained ensembles.

Within these settings, we explore variations in the training set window size (the number of past weeks of forecasts used to estimate ensemble weights).

We also consider three quantile grouping strategies: "per model" weights, "per quantile" approaches where there is a separate weight parameter for each combination of model and quantile level, and "3 groups" of quantile levels: the three lowest, the three highest, and the middle ones.

We compare to two "untrained" ensembles: an equally-weighted mean (ew) at each quantile level and a median at each quantile level.

We perform estimation either separately for each spatial scale (National, State, and County), or jointly across the State and National levels.

The overall average scores in the tables below are computed across a comparable set of forecasts for all models, determined by the model evaluated with the fewest available forecasts (corresponding to a training set window of 10). For incident deaths, the relative rankings of median and mean ("ew") can change as a few more weeks are added or removed from the evaluation set. Per-week scores plotted further down are computed across a comparable set of forecasts for all models that are available within each week.

Incident Cases

National

National level mean scores across comparable forecasts for all methods.

State

State level mean scores across comparable forecasts for all methods:

County

County level mean scores across comparable forecasts for all methods:

Incident Hospitalizations

National

National level mean scores across comparable forecasts for all methods.

State

State level mean scores across comparable forecasts for all methods:

Incident Deaths

National

National level mean scores across comparable forecasts for all methods:

State

State level mean scores across comparable forecasts for all methods:

Cumulative Deaths

National

National level mean scores across comparable forecasts for all methods:

State

State level mean scores across comparable forecasts for all methods:

The high WIS for the equal weighted mean here is not a bug -- one forecast was crazy high in the upper tail; this shows up in WIS but not in the other metrics.

Plots showing scores by week

In these plots we show results for the mean, median, and the top-performing convex approach within each combination of base target and spatial scale.

For readability, we also drop the score for the unweighted mean ensemble forecast of state level cumulative deaths in the week where that method had very high WIS.

WIS by week

MAE by week

Two-sided interval coverage by week:

50%

80%

95%

Forecast Score Availablity

This section displays heat maps showing score availability by date, target_variable, spatial scale, and model. In each cell, we expect to see a number of scores equal to the number of locations for the given spatial scale times the number of horizons for the given target.

All forecasts

County

State

National

Forecasts available for all models that are available within each combination of base target and spatial scale

Here we have subset the forecasts to those that are comparable across all models within each combination of base target and spatial scale. We expect to see the exact same score counts for all models within each plot facet. Average scores computed within a combination of base target and spatial scale will be comparable.

County

State

National

Forecasts available for all models that are available within each combination of base target, spatial scale, and week

Here we have subset the forecasts to those that are comparable across all models within each combination of base target, spatial scale, and week. We expect to see the exact same score counts within each column of the plot, for all models for which any forecasts are available. Average scores computed within a combination of base target, spatial scale, and forecast week will be comparable.

County

State

National